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1 Introduction 

AutoBayes is a fully automatic program synthesis system for the statistical 
data analysis domain. Its input is a concise description of a data analysis problem 
in the form of a statistical model; its output is optimized and fully documented 
C/C+-F code which can be linked dynamically into the Matlab and Octave envi- 
ronments. AutoBayes synthesizes code by a schema-guided deductive process. 
Schemas (i.e., code templates with associated semantic constraints) are applied 
to the original problem and recursively to emerging subproblems. AutoBayes 
complements this approach by symbolic computation to derive closed-form solu- 
tions whenever possible. In this paper, we concentrate on the interaction between 
the symbolic computations and the deductive synthesis process; a detailed de- 
scription of AutoBayes can be found in [FSP00,FS01]. 

A statistical model specifies for each problem variable (i.e., data or parame- 
ter) its properties and dependencies in the form of a probability distribution. A 
typical data analysis task is to estimate the best possible parameter values from 
the given observations or measurements. The following example models normal- 
distributed data but takes prior information (e.g., from previous experiments) 
on the data’s mean value and variance into account. 

1 model normal as ’Normal model with conjugate priors’. 

2 const double kappa.O, mu.O. 

3 where 0 < kappa.O. 

4 double mu ~ gauss (mu. 0, sqrt (sigma.sq/kappa.O) ) . 

5 const double sigma.O.sq, delta. 0. 

6 where 0 < sigma. O.sq and 0 < delta.O. 

7 double sigma. sq * invgamma(delta.0/2+l , sigma 0 sq*(delta .0/2)). 

8 const nat n. points. 

9 where 0 < n. points. 

10 data double x(0 . . n.points-1) " gauss(rau, sqrt (sigma.sq) ) . 

11 !§% pr({x, mu, sigma.sq}) for {mu, sigma.sq}. 

Here, lines 8-10 describe the data properties: x is a vector of n.points real- 
valued observations that are independently drawn from a normal or Gaussian 
distribution with unknown mean mu and variance sqrt (sigma jq). Lines 2-4 
specify the prior information on mu, which is itself drawn from a normal distribu- 
tion. This prior summarizes a number of previous experiments, where mu turned 



r.ions. The application of these schemas is also guided by the network structure 
but they require more substantial symbolic computations. The skeleton of the 
synthesized code is generated by the application of statistical algorithm schemas. 
A [ 'TO B ayes currently implements two such schemas, the EM-algorithm and k- 
\ leans (i.e., nearest neighbor clustering). After this last network-oriented layer, 
the statistical problem has been transformed into an ordinary optimization prob- 
lem. If AutoBayes cannot find a symbolic solution for this problem, it ap- 
plies standard numeric optimization methods. AutoBayes currently provides 
schemas for the Newton-Raphson and Nelder-Mead simplex algorithms. These 
schemas are instantiated with the function to be optimized. In contrast to using 
a library function, this open approach allows further symbolic simplifications 
and optimizations. 

Symbolic Subsystem. The main task of this subsystem is to find symbolic 
solutions to optimization problems. This daunting task, however, is simplified 
substantially by the relatively uniform structure of the optimization problems 
which allow’s implementing powerful heuristics. 

At the core of the symbolic subsystem is a small but reasonably efficient 
AC-rewrite engine implemented in Prolog. Since a rewrite system for this en- 
gine is implemented naturally as a Prolog-predicate, conditional rewriting comes 
"for free.” Moreover, the rule clauses can access explicit assumptions; hence, 
ActoBayes allows conditional rules as for example xfx ->b= x=±0 1 where — x ^ 0 
means ‘‘rewrites to, provided x ^ 0 can be proven from the current assump- 
tions." The assumptions are managed almost transparently by the rewrite en- 
gine; the rewrite system only needs to contain the non-congruent propagation 
rules which modify the assumptions under which subterms are rewritten, e.g., 
if p then .5 else t f i -*\=a if P !^=a then s else t fi where 

f l±.\ is the normal form of t under the assumptions A. 

Expression simplification and symbolic differentiation are implemented on 
top of the rewrite engine. The basic rules are straightforward; however, vec- 
tors and matrices introduce the usual aliasing problems and require careful for- 
malizations. For example, as the index values i and j are usually unknown 
at synthesis time, the partial derivative dxifdxj can only be rewritten into 
if i = j then 1 else 0 f i. More advanced rules, however, require explicit meta- 
programming, especially when bound variables are involved. 

Abstract interpretation is used as an efficient mechanism to evaluate range 
constraints such asx>0orT^0 which occur in the conditions of many rewrite 
rules. AutoBayes implements as a rewrite system a domain-specific refinement 
of the standard sign abstraction where numbers are not only abstracted into pos 
and neg but also into small (i.e., \x\< 1) and large . 

It then turns out out that a relatively simple solver built on top of this core 
system is already sufficient. AutoBayes thus essentially relies on a low-order 
polynomial (i.e., linear, quadratic, and simple cubic) symbolic solver. However, it 
also shifts and normalizes exponents, recognizes multiple roots and bi-quadratic 
forms, and tries to find polynomial factors. It also handles expressions in x and 
(1 - x) which are common in Bernoulli models. 



4 Conclusions 


The tight combination of schema-guided synthesis, deduction, and symbolic com- 
putation in A CTO Bayes is essential to generate efficient code. Symbolic compu- 
tation is used for simplification and for finding symbolic solutions if they exist. 
However, we can only synthesize a correct program from a specification when we 
can rely on the soundness of the symbolic machinery. This in particular means 
that all transformations have to be performed with respect to the proper assump- 
tions, like an expression being non-zero. Transformations can also give rise to 
new proof obligations, e.g., showing that a possible solution is the minimum and 
not just a saddle point. AutoBayes keeps track of all assumptions and either 
discharges them during synthesis or generates assertions to be checked during 
runtime. The importance for symbolic calculation under assumptions and the 
possible unsoundness of a commercial symbolic algebra system like Mathemat- 
ica led us to develop our own symbolic subsystem on top of Prolog. 

Although we have been able to synthesize code for various non-trivial text- 
book examples, AutoBayes’s code generating capabilities for a variety of statis- 
tical models need to be extended substantially. Besides adding further algorithm 
schemas for statistical computations and for general numerical optimization, im- 
provement of the symbolic subsystem is of major importance. The power and 
generality of the equation solver will need to be enhanced. Furthermore, for 
marginalization in statistical models, symbolic handling of (relatively) simple 
integrals is important. Each enhancement in the symbolic subsystem will lead to 
improvement of the synthesized code as more subtasks can be solved in closed 
form rather than being approximated by (slower) numerical algorithms. In all 
cases, AutoBayes ensures correctness of the synthesized code with respect to 
the specification by generating the appropriate runtime assertions and documen- 
tation. 
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